Permutation Tests for

Factorial ANOVA Designs

David C. Howell, University of Vermont

GreenBlueBar.gif GreenBlueBar.gif

Permutation tests have been around statistical analysis for nearly 75 years, and a wide variety of tests have been constructed. You would probably think that there are all sorts of algorithms to set up permutation procedures for factorial designs, but you would be wrong. Although many people have discussed them, Anderson (2001) and Manly (2007) being two of the best discussions, when a factorial design includes a significant interaction things become sticky. Edgington and Onghena (2007) point out that an exact test for interaction can not be created. However there are good approximate tests that we can use. This document will cover proposals by a number of authorities.

As I indicated above, Anderson (2001) has one of the best discussions of this kind of test. She is certainly not the first, and her work derives from work by Edgington, Manly, and others, but she has pulled much of it together. I started by writing this page around her work, but found that it was a bit easier to write it around Manly's book, although you come out at the same place in the end. I am creating this page mainly for myself. I have written the necessary code in R to implement many of Manly's alternatives, but I feel a need to pull that stuff together and make sure that I understand what I am doing. So be patient!

Example

I will start with an example taken directly from Manly (p. 144). It concerns the number of ants consumed by two sizes of lizards over each of four months. (Well, I was getting tired of examples from psychology, and who couldn't get excited by prying open the mouths of lizards and counting the number of ants they have just eaten.)

SmallLarge
June 13 242 105182 21  7
July  8  59   2024 312 68
August515 488 88460 1223 990
September18   44  21140   40    27

The cell means are

                 Small    Large
     June      120.00000  70.0000
     July       29.00000 134.6667
     August    363.66667 891.0000
     September  27.66667  69.0000

and it looks as if we have the possibility of an interaction (In June the Small guys actually ate more ants that the Large guys, but the reverse happened in other months.) It also looks as if we have might have an effect for Size, with Large guys eating more ants than Small guys. (Who would have guessed?) In fact, the analysis of variance summary table for these data confirms that. The interaction is very borderline, as is the Size effect, but the Months effect is substantial.

The means for Size and Months follow, along with the ANOVA summary table.

  
            Small    Large
     [1,] 135.0833 291.1667 
June July August September [1,] 95 81.83333 627.3333 48.33333

         The standard ANOVA for these  data follows:

                  df Sum Sq Mean Sq   Fvalue  Pr(>F)    
     size         1  146172  146172  4.4699   0.05055   
     months       3 1379495  459832 14.0615 9.487e-05 ***
     size:months  3  294009   98003  2.9969   0.06171   
     Residuals   16  523222   32701                      

Permutation Test Approach

I am assuming that the reader will know about permutation tests and the general procedures lying behind them. If you don't, I recommend going to www.uvm.edu/~dhowell/StatPages/index.html.

The basic idea with a one-way Anova design is that if the null hypothesis is true, an observation in Group 1 could just as easily have fallen in Group 2, and vice versa. So we simply permute the observations across groups, calculate an F, repeat this perhaps 5000 times, and find the percentage of repetitions in which the calculated values of F exceeded the Fs obtained from the original data. This is the p-value under the null hypothesis.

The above description applies easily to the case of a one-way Anova or a t test, where it is obvious how permutations should be done. However things get messy when it comes to a factorial Anova because it is not always obvious what should be permuted and over what cells the permutation should take place. I will develop that idea in a moment. But first I need to discuss the assumptions behind our tests.

If we have a true experiment, where observations (subjects) are assigned to the conditions at random, then life is simple. We only consider the case of the null hypothesis in forming our sampling distribution of outcomes, and under the null hypothesis random assignment will insure that the test is valid. However when we come to observational studies things are a bit messier. In our example here, it is quite clear that we cannot randomly assign subjects to the Small versus Large sizes. In this case we must assume that the observations are exchangeable, meaning that it is reasonable for a score to fall in either group under the null. If the observations are independent, and if errors come from the same distribution (i.e., they are identically and independently distributed --iid) then observations are exchangeable. You will notice that this is a much milder assumption than we make with traditional parametric tests. (The data do not have to be normal (and these data certainly are not) or have any other specific distribution. But notice that requiring the errors of observations to be iid brings in the traditional assumption of homogeneity of variance.

There are a number of different approaches to analyzing factorial designs, but the first problem concerns the interaction. In the case of an interaction there is no exact test that we can use. (An exact test is one that has a probability of a Type I error of exactly .05, for example, under repeated sampling when the null is true.) But we can develop tests which are approximately exact, and that is very likely to be more than adequate for our needs except in extreme cases.

We need to run our test so that we test our interaction in such a way as to control for main effects. One way to control for main effects is to restrict how we do our permutations. For example, if we want to control for a main effect of Size, we could restrict our permutations so as to only permute among months within each Size group. Then any Size effects, if present, will not influence the results of resampling. For example, we could permute observations between Months for Small lizards and then do the same thing for Large lizards, but we could not permute a score from one lizard size into another. This would help to control for effects of Months, but we still have potential effects of Size. If we further restrict permutations to stay within each Size x Month combination, we would obtain the same F for all allowable permutations, and that wouldn't get us far. So we need to try something else.

First we need to read in the data. In R the commands would be

ants <- c(13, 242, 105, 182, 21, 7, 8, 59, 20, 24, 312, 68, 515, 488, 88,
460, 1223, 990, 18, 44, 21, 140, 40, 27)
  #These next two commands merely set up
  #the factors of size and months as 1s and 2s or 1, 2, 3, 4.
size <- as.factor(rep(1:2, each = 3, times = 4)) 
months <- as.factor(rep(1:4, each = 6))

Manly's approach -- Unrestricted Permutation of Observations

One way to permute our data would be to randomize them over all cells in the experiment. For example, we could take our 24 values, shake them up in a hat, and write them down in whatever order they came out of the hat next to the columns that contain information on Size and Months. We would calculate the F for our three effects, store those values away, and then repeat this procedure another 4999 times. This would leave us with 5000 values of FSM, FS, and FM that would reasonably occur under the null hypothesis. We then compare our obtained F against that distribution and calculate the percentage of replications under the null hypothesis when the resampled Fs exceed the obtained F. We would do the same thing with the two main effects. This is the procedure recommended by Manly (2007), and is perhaps the easiest to carry out. The R code and the resulting probability values follow.


# Standard Anova on these data
mod1 <- lm(ants ~ size + months + size:months)
ANOVA <- summary(aov(mod1))
cat( " The standard ANOVA for these data follows ","\n")
Fsize <-  ANOVA[[1]]$"F value"[1]   # Saving F values for future use
Fmonths <-  ANOVA[[1]]$"F value"[2]
Finteract <-  ANOVA[[1]]$"F value"[3]
print(ANOVA, "\n")
cat( "\n")
cat( "\n")

print( "Resampling as in Manly with unrestricted sampling of observations. ")
# Now start resampling
nreps <- 5000 
FS <- numeric(nreps)    #Set up space to store F values as calculated.
FM <- numeric(nreps)  
FSM <- numeric(nreps)
FS[1] <- Fsize          # The first F of our 5000 
FM[1] <- Fmonths
FSM[1] <- Finteract
for (i in 2:nreps) {
  newants <- sample(ants, 24)
  mod2 <- lm(newants ~ size + months + size:months)
  b <- summary(aov(mod2))
  FS[i] <- b[[1]]$"F value"[1]
  FM[i] <- b[[1]]$"F value"[2]
  FSM[i] <- b[[1]]$"F value"[3]
  }
probS <- length(FS[FS >= Fsize + .Machine$double.eps ^0.5])/nreps
probM <- length(FM[FM >= Fmonths+ .Machine$double.eps ^0.5])/nreps       
probSM  <-  length(FSM[FSM >= Finteract + .Machine$double.eps ^0.5])/nreps
### The addition of "+ .Machine$double.eps" is an aid against two numbers that differ only by
### floating point computer calculations at the extreme.

cat(" The probability value for the interaction is ",probSM, "\n")
cat(" The probability value for Size is ", probS, "\n")
cat(" The probability value for Months is ", probM, "\n")

Manly's Approach--unrestricted permutation of observations.

 The probability value for the interaction is  0.048 

 The probability value for Size is  0.0458 

 The probability value for Months is  2e-04 
 
###################################################################

Edgington's approach -- Restricted permutation of main effects

Edgington (2007) goes at the problem from a different direction. He maintains that there is no exact test for interactions, but suggests that you can get an "indication" of the presence of interactions by testing the interaction in the same way that Manly did. But he only tests the interaction that way. In other words Edgington permutes all observations across all cells 5000 times, computes the F for the interaction for each one, and then calculates the percentage of cases where the F on the permuted data exceeded our obtained F.

But what about the main effects? Well, there are two ways to deal with them. If the interaction is significant, you probably don't care about the main effects. I have been a holdout arguing that it is OK to look at the main effects if you have a good reason, but I am coming to the view that you really seldom have a good reason to want to deal with the main effects in the face of an interaction. I am now joining the rest of the statistical community. Look at simple effects if you want to, but skip the main effects.

But if the interaction effect is not significant you would probably want to go ahead and deal with main effects. One way to do this is the same way that Manly does. But Edgington would argue that if you don't have an interaction, your best model is Yijk = μ + Si. + M.j + εijk, which is an additive model--it does not have an interaction. (This is the approach that SPSS calls Type II SS, although SPSS does not make it conditional on a nonsignificant interaction.) Edgington reasons that if this is now your model you don't have to adjust for an interaction. Therefore you can test Size by shuffling the data for each month separately between size categories, and then test Months by shuffling the data for each size among the month categories. In each of these steps you form the distribution of the randomized Fs and compare your obtained Fs against those distributions. The code for this, and the resulting output, is shown below.

# Now moving to Edgington
#    Edgington takes the interaction to be the same as Manly above.
#    To test Size and Month he uses restricted randomization and the main effect Fs.

cat( "\n")
cat( "\n")
print( "Resampling as in Edgington with restricted sampling for main effects. ")
cat("Edgington's test for interaction has probability = ", probSM, "\n which is
the same as Manly's.")

# I am now going to work with an additive model.
# Main effect of Size
FS <- numeric(nreps)
FS[1] <- Fsize
for (i in 2:nreps) {
M1 <- ants[1:6]      # Combining across size within month
M2 <- ants[7:12]
M3 <- ants[13:18]
M4 <- ants[19:24]
temp1 <- sample(M1,6, replace = FALSE)
temp2 <- sample(M2,6, replace = FALSE)
temp3 <- sample(M3,6, replace = FALSE)
temp4 <- sample(M4,6, replace = FALSE)
newants <- c(temp1, temp2,temp3, temp4) 
mod5 <- summary(aov(newants ~ size + months )) 
FS[i] <- mod5[[1]]$"F value"[1] } 
probS <- length(FS[FS >= Fsize + .Machine$double.eps ^0.5])/nreps 
cat("The probability for the effect of Size is ", probS,"\n") 

# Main effect of Months
FM <- numeric(nreps) 
FM[1]  <- Fmonths 
for (i in  2:nreps){
  S1 <- ants[size == 1] 
  S2 <- ants[size == 2]   
temp1  <- sample(S1, 12, replace = FALSE) 
temp2  <- sample(S2, 12, replace = FALSE)
newants  <-  c(temp1,temp2)
mod6<- summary(aov(newants ~ size + months)) 
FM[i] <- mod6[[1]]$"F value"[2] 
  } 
probM <- length(FM[FM >= Fmonths + .Machine$double.eps ^0.5])/nreps
cat("The probability for the effect of Months is ", probM, "\n")
 

[1] "Resampling as in Edgington with restricted sampling for main effects. "

Edgington's test for interaction has probability =  0.048 

The probability for the effect of Size is  0.04 

The probability for the effect of Months is  2e-04 
###############################################################################

Still and White's Permutation of Residuals

Still and White (1981) follow Edgington in using restricted randomization for main effects, but they control for main effects by using residuals. In this case the residuals are what is left over after you remove row and column effects. If you look at all of your data combined, there are potential main effects of Size and Months included in them. But if we subtract out those effects and use the residuals that result, we can test the interaction without worrying about main effects. Because we have removed any possible row and column effects, the residuals are exchangeable under the null hypothesis. One simple way to do this is to compute rijk = Yijk - Ybari. - Ybar.j + Ybar... In other words for each observation in Row1 and Column2, we subtract out the Row1 mean and the Column2 mean and add back in the grand mean. We do the analogous thing to observations in the other cells. If you now ran a complete analysis of variance on the residuals you would find that the sums of squares for Size and Months were exactly 0, and the F for the interaction would be a fair test of that interaction uncontaminated by main effects.

The procedure just described is the easiest way to describe what Still and White are doing in computing their interaction, but I can accomplish the same thing by fitting an additive model (a model with no interaction term) to the data and asking my program to calculate residuals. In other words, if you were using SPSS, for example, you would just tell it to run a two-way Anova but specify that the model does not contain the interaction term. (Just click on the "model" button.) Then ask it to save the residuals and use those residuals as the dependent variable in a two-way with the interaction included. You will get the same result either way you do it, but I think that the second is easier to code.

But we only do this for the interaction. The main effects will not be calculated on these residuals because they would come out to be zero. When it comes to the main effects, Still and White follow Edgington's lead and will obtain the same results (within sampling error). Thus we don't have to recalculate those results because we will just steal them from Edgington. The code and the output follow.

# Now moving to Still and White (1981)
#  They test the interaction using residuals of the form
#  Y[ij] - Ybar[i,] - Ybar[.j] + Ybar[..]
# The easiest way to do this is to get the residuals from an additive model.
mod2 <- lm(ants ~ months + size)
res <- mod2$residuals   # These are the residuals from the additive model
# Now do the resampling to get the interaction F.
SHint <- numeric(nreps)
SHint[1] <- Finteract
for (i in 2:nreps) {
  newants <- sample(res, 24, replace = FALSE) 
  mod7 <- summary(aov(newants ~ size + months + size:months)) 
  SHint[i] <- mod7[[1]]$"F value"[3]
    } 
probInt <- length(SHint[SHint >= Finteract + .Machine$double.eps ^0.5])/nreps  
       
cat( "\n")  
cat("\n") 
print("Resampling as in Still and White with unrestricted sampling. ") 
cat("The probability for the effect of Interaction is ", probInt, "\n") 
# Still and White have the same main effects as Edgington if there is no interaction. 
cat("The probability for the effect of Size is ", probS, "\n") 
cat("The probability for the effect of Months is ", probM, "\n") 
########################################################################### 
 "Resampling as in Still and White with restricted sampling for main effects " 
 and randomization of residuals for the interaction." 
The probability for the effect of Interaction is 0.0512    
 
# Still and White have the same main effects as Edgington
The probability for the effect of Size is  0.04 
The probability for the effect of Months is  2e-04 

ter Braak's method (1992)

ter Braak has done a great deal of work with randomization procedures and he advocates an approach similar to the Still and White approach except that in calculating the interaction the residuals are taken over the whole design rather than just the additive model. This amounts to removing the cell mean from each observation, to create a set of residuals, and then permute those residuals across all cells of the design. This is very much like the Still and White approach but they only subtracted row and column means, not individual cell means. Again this amounts to fitting a complete model (including the interaction) and computing the residuals. The code and the results of this analysis follow.

# Ter Braak creates residuals from cell means and then permutes across
# all cells
#  This can be accomplished by taking residuals from the full model

mod9 <- lm(ants ~ months + size + months:size)
res2 <- mod9$residuals
TBint <- numeric(nreps)
#  We don't have to calculate the effects of Month and Size, because
#  these come from the calculations for Edgington.

TBint[1] <- Finteract
for (i in 2:nreps) {
  newants <- sample(res2, 24, replace = FALSE) 
  mod10 <- summary(aov(newants ~ size + months + size:months)) 
  TBint[i] <- mod10[[1]]$"F value"[3]
  } 
probInt <- length(TBint[TBint >= Finteract  + .Machine$double.eps ^0.5])/nreps

cat( "\n")
cat( "\n")
print( "Resampling as in ter Braak with unrestricted sampling of cell residuals. ")
cat("The probability for the effect of Interaction is ", probInt, "\n")   
cat("The probability for the effect of Size is ", probS, "\n")  
cat("The probability for the effect of Months is ", probM, "\n") 


"Resampling as in ter Braak with unrestricted sampling of cell residuals. "
The probability for the effect of Interaction is  0.0488 
The probability for the effect of Size is  0.04 
The probability for the effect of Months is  2e-04 


Summary

The solutions given here are not the only possibilities, but they cover the important ones. The paper by Anderson (2001) is excellent and offers a number of recommendation based on power and error rates coming from on a related study by Anderson and ter Braak (unpublished). She basically suggests using a test such as Manly's for the interaction, where you use unrestricted permutation of raw observations over all cells. This is particularly helpful with small sample sizes. When it comes to testing main effects, Anderson suggests that permutation of residuals under the reduced model. This amounts to calculating residuals from an additive model and then randomizing them the same way that Edgington would randomize observations--i.e., restricting the reandomizations to individual levels of the other variable.

We can summarize the results of these different analyses by setting up a table showing the obtained p-values. This table follows. You will notice two things. First, it does not make a huge difference which approach you take, although there are differences. Second, some of the probabilities are exactly equal because for those terms the different approaches do the same thing. For a good overview of this whole field, going beyond ANOVA into partial regression, I recommend Anderson's paper.

Summary of results in p-values
Effect
Parametric F
Manly
Edgington
Still-White
ter Braak
Size
.0506
.0485
.0400
.0400
.0480
Months
.0000
.0000
.0000
.0000
.0000
SxM
.0617
.0480
.0480
.0512
.0488

Nested Designs

(This should be a separate page, but I haven't gotten around to that yet.) So far I have only spoken about standard factorial designs, often called crossed designs. In these cases every level of the row variable is paired with every level of the column variable. But we sometimes have what are called "nested designs," in which we don't have the same kind of pairing. We need a way to deal with those designs as well.

Suppose that we want to take 5 male therapists and 5 female therapists and have each of them provided a numerical rating of each of 10 clients. You don't have the same type of crossing of therapists with gender because Mary cannot serve as a male therapist and John can't serve as a female therapist. So we can't calculate an interaction of therapist x gender. Instead, we will have a term for Gender, a term for Therapist(Gender), read "therapist within gender," and a term for within cell error. The Therapist(Gender) term will actually be the sum of the simple effect of Therapist at each level of Gender.

The following data can be used to illustrate the problem.


Therapist

1
2
3
4
5
Male
9
8
6
8
10
4
6
5
7
7
7
9
6
6
6
11
6
3
8
7
11
13
8
6
14
11
13
13
10
11
12
11
16
11
9
23
12
10
19
11
10
19
14
5
10
11
14
15
11
11

Therapist

6
7
8
9
10
Female
8
6
4
6
7
6
5
7
9
7
10
7
8
10
4
7
10
6
7
7
14
11
18
14
13
22
17
16
12
11
20
16
16
15
18
16
20
22
14
19
21
19
17
15
22
16
22
22
18
21

It is difficult to produce the standard analysis of variance summary table for a nested design in R. (At least it is difficult for me.) But you can get the appropriate result (sort of) if you use the command
a <- summary(aov(dv ~ gen + th/gen )).
The output is shown below

            Df  Sum Sq Mean Sq F value    Pr(>F)    
gender       1  240.25 240.250  29.936 3.981e-07 ***
therapist    8 1705.24 213.155  26.559 < 2.2e-16 ***
Residuals   90  722.30   8.026                      
---

BUT that is not quite what you want. Gender should not be tested against the residuals. You can compute your F for Gender by dividing by MS(Therapist(Gender)). You can get the F for Therapist(Gender) if you divide Therapist by Error: within (also shown as Residuals). The correct result is given below.

Source              df           SS     MS        F 
Gender               1     240.25     240.25      1.12711 
  Error1             8    1705.24     213.155     
Therapist(Gender)    8    1705.24     213.155    26.5595 
  Error2             90     722.3     8.02556 


As you can see from the summary table, there is clearly no effect due to Gender, but there is a significant Therapist within Gender effect. This is not particularly exciting because all that it really says is that therapists are not all alike, which finding will not likely win us the Psychologist of the Year award.

It is relatively easy to create the necessary code for a permutation test on these data. What is important here is to recognize that the permissible randomizations depend on the effect being tested. For example, to test differences between therapists nested within genders [Therapist(Gender)] it does not make sense to redistribute scores from a male therapist with scores from a female therapist because therapists are nested within treatment. So what we will do is to restrict our randomization of observations to stay within the same category of Gender. But to test the effect of Gender, we will randomly allocate "whole" therapists. By that I mean that we will take the 10 scores for each therapist and move them as a whole from one Gender to another. To quote Anderson, with suitable changes in variable names,

"If the null hypothesis were true, the variability due to Therapists within Genders would be similar to the residual variation. That is, variability due to Therapist is exchangeable with residual variation under the null hypothesis. If SStotal and SSGender remain constant under permutation, then the shuffling strategy is 'mixing' variability only between SST(G) and SSresidual, which gives an exact test for differences among genders."

Resampling approach

# Resampling Nested Designs
dv <- c(9,8,6,8,10,4,6,5,7,7,
7,9,6,6,6,11,6,3,8,7,
11,13,8,6,14,11,13,13,10,11,
12,11,16,11,9,23,12,10,19,11,
10,19,14,5,10,11,14,15,11,11,
8,6,4,6,7,6,5,7,9,7,
10,7,8,10,4,7,10,6,7,7,
14,11,18,14,13,22,17,16,12,11,
20,16,16,15,18,16,20,22,14,19,
21,19,17,15,22,16,22,22,18,21)

ther <- as.factor(rep(1:10, each = 10))
gender <- as.factor(rep(1:2, each = 50))
n <- 10
N <- length(dv)
g <- nlevels(gender)
tt <- nlevels(ther)

# F values from Nested.R
mod1 <- summary(aov(dv ~ gender*ther))
Fthwingen1 <- mod1[[1]]$"F value"[2]
Fgen1 <- mod1[[1]]$"Mean Sq"[1]/mod1[[1]]$"Mean Sq"[2]
Fthwingen <- numeric(nreps)
Fgen <- numeric(nreps)
# First we will compute the Gender effect
#  Randomly shuffle Therapists across Gender
#    Make a matrix containing therapists on columns
dvmat <- matrix(dv, nrow = tt, byrow = TRUE)

for (i in 1:nreps) {
  worms <- dvmat[sample(1:10,10),]   #worms???
  newdv <- as.vector(t(worms))
  a <- summary(aov(newdv~gender*ther))
       # This gives us the appropriate terms because it removes
       # the 1 df that would go to the interaction from th, giving ther(gender)
       # But if you enter the terms in the other order it won't work.
  MSgen <- a[[1]]$"Mean Sq"[1]
  MSth <- a[[1]]$"Mean Sq"[2]
  MSerrorwin <-  a[[1]]$"Mean Sq"[3]
  Fgen[i] <- MSgen/MSth

  S1 <- dv[gender == 1]
  S2 <- dv[gender == 2]
  temp1 <- sample(S1, length(S1), replace = FALSE)
  temp2 <- sample(S2, length(S2), replace = FALSE)
  newdv2 <- c(temp1, temp2)
  mod6<- summary(aov(newdv2 ~ gender * ther))
  Fthwingen[i] <- mod6[[1]]$"F value"[2]
  #Fgen[i] <- MSgen/  mod6[[1]]$"Mean Sq"[2]
  }
  
 probgen <- length(Fgen[Fgen >=  Fgen1])/nreps
 probth <- length(Fthwingen[Fthwingen >= Fthwingen1])/nreps
 par(mfrow = c(2,2))
 hist(Fthwingen, breaks = 50)
 hist(Fgen, breaks = 50)
 # This works, but the question is what is the denominator to test gender.
 # It should be MS(ther(gender)), but that is calculated with a different randomization.
 # I have used MSther(gender) from first analysis.
 

References

Anderson, M. J. (2001) Permutation tests for univariate or multivariate analysis of variance and regression. Canadian Journal of Fisheries and Aquatic Science, 58, 626-639.

Edgington, E. S. and Onghena, P. (2007) Randomization Tests (4th ed.)London, Chapman & Hall.

Manly, B. F. J. (2007) Randomization, Bootstrap, and Monte Carlo Methods in Biology (3rd ed.), London: Chapman & Hall.

ter Braak, C. J. F. (1992) Permutation versus bootstrap significance testss in multiple regression and ANOVA. In Bootstrapping and Related Techniques. (K. J. Jockel, Ed.), Springer-Verlag, Berlin, pp. 79-86.

GreenBlueBar.gif GreenBlueBar.gif



dch: